Categorical Model

Classification
Regression
GLM
Modeling a categorical dependent variable with more than two nominal outcomes.

General Principles

To model the relationship between a outcome variable in which each observation is a single choice from a set of more than two categories and one or more independent variables, we can use a Categorical model.

Considerations

Caution
  • We have the same considerations as for Regression for continuous variable.

  • One way to interpret a Categorical model is to consider that we need to build K - 1 linear models, where K is the number of categories. Once we get the linear prediction for each category, we can convert these predictions to probabilities by building a simplex πŸ›ˆ. To do this, we convert the regression outputs using the softmax function πŸ›ˆ (see the β€œjax.nn.softmax” line in the code).

  • The intercept \alpha captures the difference in the log-odds of the outcome categories; thus, different categories need different intercepts.

  • On the other hand, as we assume that the effect of each predictor on the outcome is consistent across all categories, the regression coefficients \beta are shared across categories.

  • The relationship between the predictor variables and the log-odds of each category is modeled linearly, allowing us to interpret the effect of each predictor on the log-odds of each category.

Example

Below is an example code snippet demonstrating a Bayesian Categorical model using the Bayesian Inference (BI) package. This example is based on McElreath (2018).

Code
from BI import bi, jnp
import jax
# Setup device ------------------------------------------------
m = bi('cpu')

# Import Data & Data Manipulation ------------------------------------------------
# Import
from importlib.resources import files
data_path = m.load.sim_multinomial(only_path=True)
m.data(data_path, sep=',') 

# Define model ------------------------------------------------
def model(career, income):
    a = m.dist.normal(0, 1, shape=(2,), name = 'a')
    b = m.dist.half_normal(0.5, shape=(1,), name = 'b')
    s_1 = a[0] + b * income[0]
    s_2 = a[1] + b * income[1]
    s_3 = [0] #pivot
    p = jax.nn.softmax(jnp.stack([s_1[0], s_2[0], s_3[0]]))
    m.dist.categorical(probs=p, obs=career)

# Run sampler ------------------------------------------------ 
m.fit(model)  # Optimize model parameters through MCMC sampling

# Summary ------------------------------------------------
m.summary() # Get posterior distributions
jax.local_device_count 16
  0%|          | 0/1000 [00:00<?, ?it/s]warmup:   0%|          | 1/1000 [00:01<18:09,  1.09s/it, 1 steps of size 2.34e+00. acc. prob=0.00]warmup:  12%|β–ˆβ–        | 119/1000 [00:01<00:06, 137.37it/s, 3 steps of size 9.90e-01. acc. prob=0.78]warmup:  24%|β–ˆβ–ˆβ–       | 240/1000 [00:01<00:02, 289.10it/s, 127 steps of size 1.00e-01. acc. prob=0.78]warmup:  35%|β–ˆβ–ˆβ–ˆβ–Œ      | 354/1000 [00:01<00:01, 431.72it/s, 7 steps of size 6.12e-01. acc. prob=0.79]  warmup:  48%|β–ˆβ–ˆβ–ˆβ–ˆβ–Š     | 480/1000 [00:01<00:00, 591.81it/s, 31 steps of size 9.20e-02. acc. prob=0.78]sample:  59%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰    | 590/1000 [00:01<00:00, 675.56it/s, 19 steps of size 1.86e-01. acc. prob=0.95]sample:  70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰   | 699/1000 [00:01<00:00, 770.01it/s, 13 steps of size 1.86e-01. acc. prob=0.94]sample:  80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  | 805/1000 [00:01<00:00, 841.11it/s, 27 steps of size 1.86e-01. acc. prob=0.91]sample:  91%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 911/1000 [00:01<00:00, 893.42it/s, 15 steps of size 1.86e-01. acc. prob=0.92]sample: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1000/1000 [00:01<00:00, 502.04it/s, 7 steps of size 1.86e-01. acc. prob=0.92]
arviz - WARNING - Shape validation failed: input_shape: (1, 500), minimum_shape: (chains=2, draws=4)
mean sd hdi_5.5% hdi_94.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
a[0] -2.09 0.27 -2.43 -1.67 0.03 0.03 85.36 78.74 NaN
a[1] -1.56 0.16 -1.80 -1.30 0.02 0.01 101.63 28.52 NaN
b[0] 0.05 0.05 0.00 0.12 0.01 0.01 66.53 81.49 NaN
library(BayesianInference)
jax = reticulate::import('jax')

# setup platform------------------------------------------------
m=importBI(platform='cpu')

# import data ------------------------------------------------
m$data(m$load$sim_multinomial(only_path=T), sep=',')
keys <- c("income", "career")
income = unique(m$df$income)
income = income[order(income)]
values <- list(jnp$array(as.integer(income)),jnp$array( as.integer(m$df$career)))
m$data_on_model = py_dict(keys, values, convert = TRUE)

# Define model ------------------------------------------------
model <- function(income, career){
  # Parameter prior distributions
  alpha = bi.dist.normal(0, 1, name='alpha', shape = c(2))
  beta = bi.dist.normal(0.5, name='beta')
  
  s_1 = alpha[0] + beta * income[0]
  s_2 = alpha[1] + beta * income[1]
  s_3 = 0 # reference category

  p = jax$nn$softmax(jnp$stack(list(s_1, s_2, s_3)))

  # Likelihood
  m$dist$categorical(probs=p[career], obs=career)
}

# Run MCMC ------------------------------------------------
m$fit(model) # Optimize model parameters through MCMC sampling

# Summary ------------------------------------------------
m$summary() # Get posterior distribution
using BayesianInference

# Setup device------------------------------------------------
m = importBI(platform="cpu")

# Import Data & Data Manipulation ------------------------------------------------
# Import
data_path = m.load.sim_multinomial(only_path = true)
m.data(data_path, sep=',')

# Define model ------------------------------------------------
@BI function model(career, income)
    a = m.dist.normal(0, 1, shape=(2,), name = "a")
    b = m.dist.half_normal(0.5, shape=(1,), name = "b")
    
    # indexing works now because of the package update
    s_1 = a[0] + b * income[0]
    s_2 = a[1] + b * income[1]
    
    # ⚠️ Use jnp.array to create a Python object, so [0] indexing works
    s_3 = jnp.array([0.0]) 
    
    # Now s_3[0] is valid because it calls Python's __getitem__(0)
    p = jax.nn.softmax(jnp.stack([s_1[0], s_2[0], s_3[0]]))
    
    m.dist.categorical(probs=p, obs=career)
end

# Run mcmc ------------------------------------------------
m.fit(model)  # Optimize model parameters through MCMC sampling

# Summary ------------------------------------------------
m.summary() # Get posterior distributions

Mathematical Details

We can model a Categorical model using a Categorical distribution. The multinomial distribution models the counts of outcomes falling into different categories. For an outcome variable 𝑦 with 𝐾 categories, the multinomial likelihood function is:

Y_i \sim \text{Categorical}(\theta_i) \\ \theta_i = \text{Softmax}(\phi_i) \\ \phi_{[i,1]} = \alpha_1 + \beta_1 X_i \\ \phi_{[i,2]} = \alpha_2 + \beta_2 X_i \\ ... \\ \phi_{[i,k]} = 0 \\ \alpha_{k} \sim \text{Normal}(0,1) \\ \beta_{k} \sim \text{Normal}(0.1)

Where:

  • Y_i is the dependent categorical variable for observation i indicating the category of the observation.

  • \theta_i is a vector unique to each observation, i, which gives the probability of observing i in category k.

  • \phi_i give the linear model for each of the k categories. Note that we use the softmax function to ensure that that the probabilities \theta_i form a simplex πŸ›ˆ.

  • Each element of \phi_i is obtained by applying a linear regression model with its own respective intercept \alpha_k and slope coefficient \beta_k. To ensure the model is identifiable, one category, K, is arbitrarily chosen as a reference or baseline category. The linear predictor for this reference category is set to zero. The coefficients for the other categories then represent the change in the log-odds of being in that category versus the reference category.

Reference(s)

McElreath, Richard. 2018. Statistical Rethinking: A Bayesian course with examples in R and Stan. Chapman; Hall/CRC.